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A story in three parts 


e A hard to track group (students who leave home 
to study) 


e A topical method for studying them (probabilistic 
record linkage) 


e New results (information not otherwise available) 
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A hard to track group 


e Students who leave home to study should 
register with a GP in the place where they are 
studying (which in the present case is England 
or Wales) 


e If they do, their NHS Central Register (NHSCR) 
records will be updated to show the move 
e But how often do they register (and hence how 


often do they leave their NHSCR records in 
error?) 
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Two data sets needed 


e NHSCR extract lists the students with their 
(sometimes wrong) postcodes 


e The Higher Education Statistic Agency (HESA) 
database of Scottish-domiciled students 
studying in England or Wales lists the term time 
postcode 

e To compare the HESA post code (which we 
assume to be correct) with the NHSCR post 
code requires linking the two data sets 
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The HESA data set 


47,549 Scottish-domiciled students in 2007/8 and/or 
2008/9 academic sessions 


But it turned out that it these were students studying at 
universities located in England or Wales - many records 
were of Open University (OU) students on distance 
learning courses etc 

No unique data trace for students located in England or 
Wales 

Records were removed if term time post code was 
Scottish (but not if it was missing) 


or if the date of birth was before 1984 or after 1990 (i.e. 
they had to be between 18 and 26 in 2008) 
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The NHSCR data set 


e 8,704,299 people on the extract in July 2010 


e Sample was again limited to dates of birth from 
1984 to 1990 (ie from 18 to 26 in 2008) 


e Post code valid on 1 September 2008 was 
identified 
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The linkage procedure (1) 


665,000 NHSCR records (7.6% of the total) and 
6,909 HESA records (14.5% of the total) 


Two packages used were Link Plus (US Centre 
for Disease Control) and Rec Link (US Census 
Bureau) 

Of the HESA records, 6,534 (94.6%) were 
confidently linked to the NHSCR file 


What does “confident” mean? 
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The linkage procedure (2) 


Two types of errors 


Accepting false links (do not in fact refer to the 
Same person) 


Not accepting true links (do in fact refer to the 
same person) 


Unavoidable tension between these two 


Balance struck by the relative costs of the two 
error types 
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The linkage procedure (3) 


First Middle Last Birthdate Homepcode Gen 
ANNE HELEN ROBERTSON 19880624 PA78 6TA F 
ANNE HELEN HENDERSON 19880624 PA78 6SB F 


OLIVIA KATE ABRAN 19870211 EH4 1QX F 
OLIVIA ABRAM 19870211 G13 8NN F 
JOHN SMITH 19880503 KY14 3LR M 

JOHNSMITH 19880503 KY14 3LR M 
HO CHI MINH 18900519 DG2 8RL M 
CHI-MINH HO 18900519 DG2 8AN M 
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The linkage result 


e Of the HESA records, 6,534 (94.6%) were 
confidently linked to the NHSCR file 


e The 375 who could not be linked were more 
likely to be of non-GB nationality (12% vs 4%) 
and non-white ethnicity (19% vs 10%) 

e Using only linked records for students following 
first degrees and for whom complete data is 
available leaves 3,893 records. 
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3,893 first degree students domiciled in Scotland but studying in 
England or Wales in 2007-9 


Location of health board registration at end of academic year by year of 
study 


Year of study 


Ath and 

Region 1st 2nd 3rd beyond 

Scotland 27% 19% 18% 14% 
England / 

Wales 71% 79% 81% 85% 

Other 2% 2% 1% 1% 

Total 1,279 1,157 987 470 
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Conclusions 


Probabilistic record linkage offers the best way of 
integrating data from disparate sources where there is no 
shared unique identifier 


It raises a wide range of issues from highly technical 
Statistics to legislation and public perception 


But it is not without its difficulties 


Clerical review is time consuming (and therefore 
expensive), repetitive and requires constant concentration 


But without it error rates increase. 
Even so, it is the future for data integration 
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e Opinions? 
e Comments? 
e Questions? 
e Reactions? 
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